18 research outputs found

    Exploring different representational units in English-to-Turkish statistical machine translation

    Get PDF
    We investigate different representational granularities for sub-lexical representation in statistical machine translation work from English to Turkish. We find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ii) augmenting the training data with “sentences” comprising only the content words of the original training data to bias root word alignment, (iii) reranking the n-best morpheme-sequence outputs of the decoder with a word-based language model, and (iv) using model iteration all provide a non-trivial improvement over a fully word-based baseline. Despite our very limited training data, we improve from 20.22 BLEU points for our simplest model to 25.08 BLEU points for an improvement of 4.86 points or 24% relative

    Initial explorations in English to Turkish statistical machine translation

    Get PDF
    This paper presents some very preliminary results for and problems in developing a statistical machine translation system from English to Turkish. Starting with a baseline word model trained from about 20K aligned sentences, we explore various ways of exploiting morphological structure to improve upon the baseline system. As Turkish is a language with complex agglutinative word structures, we experiment withmorphologically segmented and disambiguated versions of the parallel texts in order to also uncover relations between morphemes and function words in one language with morphemes and functions words in the other, in addition to relations between open class content words. Morphological segmentation on the Turkish side also conflates the statistics from allomorphs so that sparseness can be alleviated to a certain extent. We find that this approach coupled with a simple grouping of most frequent morphemes and function words on both sides improve the BLEU score from the baseline of 0.0752 to 0.0913 with the small training data. We close with a discussion on why one should not expect distortion parameters to model word-local morpheme ordering and that a new approach to handling complex morphotactics is needed

    Türkçe-İngilizce için istatistiksel bilgisayarlı çeviri sistemi

    Get PDF
    Bu bildiride, Türkçe İngilizce dil çifti için istatistiksel bilgisayarlı çeviri sistemi anlatılmaktadır. İki dil arasındaki yapısal farklılıklardan kaynaklanan problemler, biçimbirimsel analiz yapılarak eklerin ayrı gösterimi ile ortadan kaldırılmıştır. Yaklaşım sözcük öbeği tabanlı çözücü ile test edilmiştir. Sistem performansı, eklerin bigram tabanlı gruplandırılması ile iyileştirilmiştir. Önerilen metot ile standart modele kıyasla daha iyi sonuçlar elde edilmiştir. 22000 cümlelik paralel metinler ile oluşturulan sistemin performansı tatmin edici olmasa da bir başlangıçdır

    A prototype English-Turkish statistical machine translation system

    Get PDF
    Translating one natural language (text or speech) to another natural language automatically is known as machine translation. Machine translation is one of the major, oldest and the most active areas in natural language processing. The last decade and a half have seen the rise of the use of statistical approaches to the problem of machine translation. Statistical approaches learn translation parameters automatically from alignment text instead of relying on writing rules which is labor intensive. Although there has been quite extensive work in this area for some language pairs, there has not been research for the Turkish - English language pair. In this thesis, we present the results of our investigation and development of a state-of-theart statistical machine translation prototype from English to Turkish. Developing an English to Turkish statistical machine translation prototype is an interesting problem from a number of perspectives. The most important challenge is that English and Turkish are typologically rather distant languages. While English has very limited morphology and rather fixed Subject-Verb-Object constituent order, Turkish is an agglutinative language with very flexible (but Subject-Object-Verb dominant) constituent order and a very rich and productive derivational and inflectional morphology with word structures that can correspond to complete phrases of several words in English when translated. Our research is focused on making scientific contributions to the state-of-the-art by taking into account certain morphological properties of Turkish (and possibly similar languages) that have not been addressed sufficiently in previous research for other languages. In this thesis; we investigate how different morpheme-level representations of morphology on both the English and the Turkish sides impact statistical translation results. We experiment with local word ordering on the English side to bring the word order of specific English prepositional phrases and auxiliary verb complexes, in line with the corresponding case marked noun forms and complex verb forms, on the Turkish side to help with word alignment. We augment the training data with sentences just with content words (noun, verb, adjective, adverb) obtained from the original training data and with highly-reliable phrase-pairs obtained iteratively from an earlier phrase alignment to alleviate the dearth of the parallel data available. We use word-based language model in the reranking of the n-best lists in addition to the morpheme-based language model used for decoding, so that we can incorporate both the local morphotactic constraints and local word ordering constraints. Lastly, we present a procedure for repairing the decoder output by correcting words with incorrect morphological structure and out-of-vocabulary with respect to the training data and language model to further improve the translations. We also include fine-grained evaluation results and some oracle scores with the BLEU+ tool which is an extension of the evaluation metric BLEU. After all research and development, we improve from 19.77 BLEU points for our word-based baseline model to 27.60 BLEU points for an improvement of 7.83 points or about 40% relative improvement

    Retrieving words from their "meanings"

    Get PDF
    The human brain is the best memory that can record and keep a huge number of information for a long time. Words, their meanings, domains, relationships between different words, and the grammars of languages are well organized in the linguistic component of brain. While speaking or writing, we can generally express our thoughts and feelings by words without thinking for a long time what the correct words can be. But, sometimes things do not go like clockwork even for human brain. In our daily life, we can often forget or not remember a word that we use frequently and exactly know its meaning. While writing a document, talking with friends, or solving a puzzle, we can not remember which word to say or to write. When we face this problem, it will be of no use to attempt searching in a traditional dictionary to find the word that we can not remember. In such cases, there is a need for resources that can locate the word from its meaning. This thesis presents the design and the implementation of a Meaning to Word dictionary (MTW), that locates a set of Turkish words, which most closely matches the correct/appropriate one based on a definition entered by the user. The approach of extracting words from " meaning" s is based on checking the similarity between the user's definition and an entry of the Turkish dictionary without considering any semantics or grammatical information. MTW can be used in various application areas such as computer-assisted language learning, finding the correct words for the definition questions in solving crossword puzzles, and searching the one word representations or synonyms of a multi-word definitions in a reverse dictionary. Results on unseen data indicate that in 72% of the real users queries and 90% of different dictionaries queries, our system returns the correct answer in the first 50 results, respectively

    Use of wordnet for retrieving words from their meanings

    No full text
    This paper presents a Meaning to Word System (MTW) for Turkish Language, that finds a set of words, closely matching the defnition entered by the user. The approach of extracting words from meanings is based on checking the similarity between the user's denition and each entry of the Turkish database without considering any semantics or grammatical information. Results on unseen user queries indicate that in 66% of the queries the correct responses were in the rst 50 of the words returned, while for queries selected from the word definitions in a different dictionary in 92% of the queries correct responses were in the first 50 of the words returned. Our system make extensive uses of various linguistics resources including Turkish WordNet

    TÜBİTAK-BİLGEM German-English Machine Translation Systems for WMT'13

    No full text
    Abstract This paper describes TÜBİTAK-BİLGEM statistical machine translation (SMT) systems submitted to the Eighth Workshop on Statistical Machine Translation (WMT) shared translation task for German-English language pair in both directions. We implement phrase-based SMT systems with standard parameters. We present the results of using a big tuning data and the effect of averaging tuning weights of different seeds. Additionally, we performed a linguistically motivated compound splitting in the Germanto-English SMT system

    Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation

    No full text
    In this paper, we present the results of our work on the development of a phrase-based statistical machine translation prototype from English to Turkish-an agglutinative language with very productive inflectional and derivational morphology. We experiment with different morpheme-level representations for English-Turkish parallel texts. Additionally, to help with word alignment, we experiment with local word reordering on the English side, to bring the word order of specific English prepositional phrases and auxiliary verb complexes, in line with the morpheme order of the corresponding case-marked nouns and complex verbs, on the Turkish side. To alleviate the dearth of the parallel data available, we also augment the training data with sentences just with content word roots obtained from the original training data to bias root word alignment, and with highly reliable phrase-pairs from an earlier corpus alignment. We use a morpheme-based language model in decoding and a word-based language model in re-ranking the n-best lists generated by the decoder. Lastly, we present a scheme for repairing the decoder output by correcting words which have incorrect morphological structure or which are out-of-vocabulary with respect to the training data and language model, to further improve the translations. We improve from 15.53 BLEU points for our word-based baseline model to 25.17 BLEU points for an improvement of 9.64 points or about 62% relative
    corecore